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Abstract —We study the problem of load balancing in dis¬ 
tributed stream processing engines, which is exacerbated in the 
presence of skew. We introduce Partial Key Grouping (pkg), 
a new stream partitioning scheme that adapts the classical “power 
of two choices” to a distributed streaming setting by leveraging 
two novel techniques: key splitting and local load estimation. In so 
doing, it achieves better load balancing than key grouping while 
being more scalable than shuffle grouping. 

We test PKG on several large datasets, both real-world and 
synthetic. Compared to standard hashing, PKG reduces the load 
imbalance by up to several orders of magnitude, and often 
achieves nearly-perfect load balance. This result translates into 
an improvement of up to 60% in throughput and up to 45% in 
latency when deployed on a real Storm cluster. 

I. Introduction 

Distributed stream processing engines (DSPEs) such as S40 
Storing and Samz;£] have recently gained much attention ow¬ 
ing to their ability to process huge volumes of data with very 
low latency on clusters of commodity hardware. Streaming 
applications are represented by directed acyclic graphs (DAG) 
where vertices, called processing elements (PEs), represent 
operators, and edges, called streams , represent the data flow 
from one PE to the next. For scalability, streams are partitioned 
into sub-streams and processed in parallel on a replica of the 
PE called processing element instance (PEI). 

Applications of DSPEs, especially in data mining and ma¬ 
chine learning, typically require accumulating state across the 
stream by grouping the data on common fields III El- Akin to 
MapReduce, this grouping in DSPEs is usually implemented 
by partitioning the stream on a key and ensuring that messages 
with the same key are processed by the same PEI. This 
partitioning scheme is called key grouping. Typically, it maps 
keys to sub-streams by using a hash function. Hash-based 
routing allows each source PEI to route each message solely 
via its key, without needing to keep any state or to coordinate 
among PEls. Alas, it also results in load imbalance as it 
represents a “single-choice” paradigm a, and because it 
disregards the popularity of a key, i.e., the number of messages 
with the same key in the stream, as depicted in Figure [T| 

1 https://incubator.apache.org/s4 

2 https:// storm, incubator, apache.org 

2 https://samza.incubator.apache.org 



Fig. 1 : Load imbalance generated by skew in the key distri¬ 
bution when using key grouping. The color of each message 
represents its key. 


Large web companies run massive deployments of DSPEs 
in production. Given their scale, good utilization of the re¬ 
sources is critical. However, the skewed distribution of many 
workloads causes a few PEls to sustain a significantly higher 
load than others. This suboptimal load balancing leads to poor 
resource utilization and inefficiency. 

Another partitioning scheme called shuffle grouping 
achieves excellent load balancing by using a round-robin 
routing, i.e., by sending a message to a new PEI in cyclic 
order, irrespective of its key. However, this scheme is mostly 
suited for stateless computations. Shuffle grouping may require 
an additional aggregation phase and more memory to express 
stateful computations (Section[n]i. Additionally, it may cause a 
decrease in accuracy for data mining algorithms (Section [vT|. 

In this work, we focus on the problem of load balancing of 
stateful applications in DSPEs when the input stream follows 
a skewed key distribution. In this setting, load balancing is 
attained by having upstream PEls create a balanced partition 
of messages for downstream PEls, for each edge of the DAG. 
Any practical solution for this task needs to be both streaming 
and distributed: the former constraint enforces the use of an 
online algorithm, as the distribution of keys is not known in 
advance, while the latter calls for a decentralized solution with 
minimal coordination overhead in order to ensure scalability. 



To address this problem, we leverage the “power of two 
choices” 0 (PoTC), whereby the system picks the least loaded 
out of two candidate PEls for each key. However, to maintain 
the semantics of key grouping while using PoTC (i.e., so that 
one key is handled by a single PEI), sources would need to 
track which of the two possible choices has been made for 
each key. This requirement imposes a coordination overhead 
every time a new key appears, so that all sources agree on the 
choice. In addition, sources should then store this choice in 
a routing table. Each edge in the DAG would thus require a 
routing table for every source, each with one entry per key. 
Given that a typical stream may contain billions of keys, this 
solution is not practical. 

Instead, we propose to relax the key grouping constraint 
and allow each key to be handled by both candidate PEls. We 
call this technique key splitting; it allows us to apply PoTC 
without the need to agree on, or keep track of, the choices 
made. As shown in Section [VJ key splitting guarantees good 
load balance even in the presence of skew. 

A second issue is how to estimate the load of a downstream 
PEI. Traditional work on PoTC assumes global knowledge of 
the current load of each server, which is challenging in a 
distributed system. Additionally, it assumes that all messages 
originate from a single source, whereas messages in a DSPE 
are generated in parallel by multiple sources. 

In this paper we prove that, interestingly, a simple local 
load estimation technique, whereby each source independently 
tracks the load of downstream PEls, performs very well in 
practice. This technique gives results that are almost indistin¬ 
guishable from those given by a global load oracle. 

The combination of these two techniques (key splitting 
and local load estimation) enables a new stream partitioning 
scheme named Partial Key Grouping. 

In summary, we make the following contributions. 

• We study the problem of load balancing in modern dis¬ 
tributed stream processing engines. 

• We show how to apply PoTC to DSPEs in a principled and 
practical way, and propose two novel techniques to do so: 
key splitting and local load estimation. 

• We propose Partial Key Grouping, a novel and simple 
stream partitioning scheme that applies to any DSPE. When 
implemented on top of Apache Storm, it requires a single 
function and less than 20 lines of codeQ 

• We measure the impact of PKG on a real deployment on 
Apache Storm. Compared to key grouping, it improves 
the throughput of an example application on real-world 
datasets by up to 60%, and the latency by up to 45%. 

II. Preliminaries and Motivation 

We consider a DSPE running on a cluster of machines that 
communicate by exchanging messages following the flow of 
a DAG, as discussed. In this work, we focus on balancing the 
data transmission along a single edge in a DAG. Load balancing 
across the whole DAG is achieved by balancing along each 

4 Available at https://github.com/gdfm/partial-key-grouping 


edge independently. Each edge represents a single stream of 
data, along with its partitioning scheme. Given a stream under 
consideration, let the set of upstream PEls (sources) be S, and 
the set of downstream PEls (workers) be W, and their sizes 
be |<S| = S and |W| = W (see Figure [TJ. 

The input to the engine is a sequence of messages m = 
(t,k,v) where t is the timestamp at which the message is 
received, k £ K. ,\fC\ = K is the message key, and v is the 
value. The messages are presented to the engine in ascending 
order by timestamp. 

A stream partitioning function P t : K, —t N maps each key 
in the key space to a natural number, at a given time t. This 
number identifies the worker responsible for processing the 
message. Each worker is associated to one or more keys. 

We use a definition of load similar to others in the literature 
(e.g.. Flux S3)- At time t, the load of a worker i is the number 
of messages handled by the worker up to t: 

Li(t) = |{(t , k, v) : P T (k) = * At < i}| 

In principle, depending on the application, two different 
messages might impose a different load on workers. However, 
in most cases these differences even out and modeling such 
application-specific differences is not necessary. 

We define imbalance at time t as the difference between the 
maximum and the average load of the workers: 

I(t ) = max(Lj(f)) — avg(Lj(f)), for i £W 
i i 

We tackle the problem of identifying a stream partitioning 
function that minimizes the imbalance, while at the same time 
avoiding the downsides of shuffle grouping. 

A. Existing Stream Partitioning Functions 

Data is sent between PEs by exchanging messages over 
the network. Several primitives are offered by DSPEs for 
sources to partition the stream, i.e., to route messages to 
different workers. There are two main primitives of interest: 
key grouping (KG) and shuffle grouping (SG). 

KG ensures that messages with the same key are handled 
by the same PEI (analogous to MapReduce). It is usually 
implemented through hashing. 

SG routes messages independently, typically in a round- 
robin fashion. SG provides excellent load balance by assigning 
an almost equal number of messages to each PEI. However, 
no guarantee is made on the partitioning of the key space, as 
each occurrence of a key can be assigned to any PEls. SG is the 
perfect choice for stateless operators. However, with stateful 
operators one has to handle, store and aggregate multiple 
partial results for the same key, thus incurring additional costs. 

In general, when the distribution of input keys is skewed, 
the number of messages that each PEI needs to handle can 
vary greatly. While this problem is not present for stateless 
operators, which can use SG to evenly distribute messages, 
stateful operators implemented via KG suffer from load imbal¬ 
ance. This issue generates a degradation of the service level, or 
reduces the utilization of the cluster which must be provisioned 
to handle the peak load of the single most loaded server. 



Example. To make the discussion more concrete, we introduce 
a simple application that will be our running example: stream¬ 
ing top-k word count. This application is an adaptation of the 
classical MapReduce word count to the streaming paradigm 
where we want to generate a list of top-k words by frequency 
at periodic intervals (e.g., each T seconds). It is also a common 
application in many domains, for example to identify trending 
topics in a stream of tweets. 

Implementation via key grouping. Following the MapRe¬ 
duce paradigm, the implementation of word count described 
by Neumeyer et al. © or Noll O uses KG on the source 
stream. The counter PE keeps a running counter for each 
word. KG ensures that each word is handled by a single PEI, 
which thus has the total count for the word in the stream. At 
periodic intervals, the counter PEls send their top-k counters to 
a single downstream aggregator to compute the top-k words. 
While this application is clearly simplistic, it models quite well 
a general class of applications common in data mining and 
machine learning whose goal is to create a model by tracking 
aggregated statistics of the data. 

Clearly KG generates load imbalance as, for instance, the 
PEI associated to the key “the” will receive many more mes¬ 
sages than the one associated with “Barcelona”. This example 
captures the core of the problem we tackle: the distribution 
of word frequencies follows a Zipf law where few words are 
extremely common while a large majority are rare. Therefore, 
an even distribution of keys such as the one generated by KG 
results in an uneven distribution of messages. 
Implementation via shuffle grouping. An alternative imple¬ 
mentation uses shuffle grouping on the source stream to get 
partial word counts. These counts are sent downstream to an 
aggregator every T seconds via key grouping. The aggregator 
simply combines the counts for each key to get the total count 
and selects the top-k for the final result. 

Using SG requires a slightly more complex logic but it 
generates an even distribution of messages among the counter 
PEls. However, it suffers from other problems. Given that 
there is no guarantee which PEI will handle a key, each PEI 
potentially needs to keep a counter for every key in the stream. 
Therefore, the memory usage of the application grows linearly 
with the parallelism level. Hence, it is not possible to scale to 
a larger workload by adding more machines: the application 
is not scalable in terms of memory. Even if we resort to 
approximation algorithms, in general, the error depends on the 
number of aggregations performed, thus it grows linearly with 
the parallelism level. We analyze this case in further detail 
along with other application scenarios in Section |VT| 

B. Key grouping with rebalancing 

One common solution for load balancing in DSPEs is 
operator migration 0 IS 12 H3 UU El- Once a situation of 
load imbalance is detected, the system activates a rebalancing 
routine that moves part of the keys, and the state associated 
with them, away from an overloaded server. While this solu¬ 
tion is easy to understand, its application in our context is not 
straightforward for several reasons. 


Rebalancing requires setting a number of parameters such 
as how often to check for imbalance and how often to 
rebalance. These parameters are often application-specific as 
they involve a trade-off between imbalance and rebalancing 
cost that depends on the size of the state to migrate. 

Further, implementing a rebalancing mechanism usually 
requires major modifications of the DSPE at hand. This task 
may be hard, and is usually seen with suspicion by the 
community driving open source projects, as witnessed by the 
many variants of Hadoop that were never merged back into 
the main line of development mmm. 

In our context, rebalancing implies migrating keys from one 
sub-stream to another. However, this migration is not directly 
supported by the programming abstractions of some DSPEs. 
Storm and Samza use a coarse-grained stream partitioning 
paradigm. Each stream is partitioned into as many sub-streams 
as the number of downstream PEls. Key migration is not 
compatible with this partitioning paradigm, as a key cannot 
be uncoupled from its sub-stream. In contrast, S4 employs a 
fine-grained paradigm where the stream is partitioned into one 
sub-stream per key value, and there is a one-to-one mapping of 
a key to a PEI. The latter paradigm easily supports migration, 
as each key is processed independently. 

A major problem with mapping keys to PEls explicitly is 
that the DSPE must maintain several routing tables: one for 
each stream. Each routing table has one entry for each key 
in the stream. Keeping these tables is impractical because the 
memory requirements are staggering. In a typical web mining 
application, each routing table can easily have billions of keys. 
For a moderately large DAG with tens of edges, each with tens 
of sources, the memory overhead easily becomes prohibitive. 

Finally, as already mentioned, for each stream there are 
several sources sending messages in parallel. Modifications to 
the routing table must be consistent across all sources, so they 
require coordination, which creates further overhead. For these 
reasons we consider an alternative approach to load balancing. 

III. Partial Key Grouping 

The problem described so far currently lacks a satisfying 
solution. To solve this issue, we resort to a widely-used 
technique in the literature of load balancing: the so-called 
“power of two choices” (PoTC). While this technique is 
well-known and has been analyzed thoroughly both from a 
theoretical and practical perspective Ga ei na on limits 
application in the context of DSPEs is not straightforward and 
has not been previously studied. 

Introduced by Azar et al. Q3, PoTC is a simple and elegant 
technique that allows to achieve load balance when assigning 
units of load to workers. It is best described in terms of 
“balls and bins”. Imagine a process where a stream of balls 
(units of work) is distributed to a set of bins (the workers) as 
evenly as possible. The single-choice paradigm corresponds to 
putting each ball into one bin selected uniformly at random. By 
contrast, the power of two choices selects two bins uniformly 
at random, and puts the ball into the least loaded one. This 


simple modification of the algorithm has powerful implications 
that are well known in the literature (see Sections |IV| |VII| i. 

Single choice. The current solution used by all DSPEs to 
partition a stream with key grouping corresponds to the single¬ 
choice paradigm. The system has access to a single hash 
function T-L\(k). The partitioning of keys into sub-streams is 
determined by the function Pt{k) = T~Li(k) mod W , where 
mod is the modulo operator. 

The single-choice paradigm is attractive because of its sim¬ 
plicity: the routing does not require to maintain any state and 
can be done independently in parallel. However, it suffers from 
a problem of load imbalance 0. This problem is exacerbated 
when the distribution of input keys is skewed. 

PoTC. When using the power of two choices, we have 
two hash functions TLi(k) and 'H. 2 (k). The algorithm 
maps each key to the sub-stream assigned to the least 
loaded worker between the two possible choices, that is: 
P t (k) = argmin^L^f) : T~ii(k) = i V H 2 (k) = i). 

The theoretical gain in load balance with two choices is 
exponential compared to a single choice. However, using 
more than two choices only brings constant factor improve¬ 
ments EL Therefore, we restrict our study to two choices. 

PoTC introduces two additional complications. First, to 
maintain the semantics of key grouping, the system needs to 
keep state and track the choices made. Second, the system has 
to know the load of the workers in order to make the right 
choice. We discuss these two issues next. 

A. Key Splitting 

A naive application of PoTC to key grouping requires the 
system to store a bit of information for each key seen, to keep 
track of which of the two choices needs to be used thereafter. 
This variant is referred to as static PoTC. 

Static PoTC incurs some of the problems discussed for key 
grouping with rebalancing. Since the actual worker to which a 
key is routed is determined dynamically, sources need to keep 
a routing table with an entry per key. As already discussed, 
maintaining this routing table is often impractical. 

In order to leverage PoTC and make it viable for DSPEs, we 
relax the requirement of key grouping. Rather than mapping 
each key to one of the two possible choices, we allow it to be 
mapped to both choices. Every time a source sends a message, 
it selects the worker with the lowest current load among the 
two candidates associated to that key. This technique, called 
key splitting, introduces several new trade-offs. 

First, key splitting allows the system to operate in a decen¬ 
tralized manner, by allowing multiple sources to take decisions 
independently in parallel. As in key grouping and shuffle 
grouping, no state needs to be kept by the system and each 
message can be routed independently. 

Key splitting enables far better load balancing compared to 
key grouping. It allows using PoTC to balance the load on the 
workers: by splitting each key on multiple workers, it handles 
the skew in the key popularity. Moreover, given that all its 
decisions are dynamic and based on the current load of the 


system (as opposed to static PoTC), key splitting adapts to 
changes in the popularity of keys over time. 

Third, key splitting reduces the memory usage and aggrega¬ 
tion overhead compared to shuffle grouping. Given that each 
key is assigned to exactly two PEls, the memory to store its 
state is just a constant factor higher than when using key 
grouping. Instead, with shuffle grouping the memory grows 
linearly with the number of workers W. Additionally, state 
aggregation needs to happen only once for the two partial 
states, as opposed to W — 1 times in shuffle grouping. This 
improvement also allows to reduce the error incurred during 


aggregation for some algorithms, as discussed in Section VI 


From the point of view of the application developer, key 
splitting gives rise to a novel stream partitioning scheme 
called Partial Key Grouping, which lies in-between key 
grouping and shuffle grouping. 

Naturally, not all algorithms can be expressed via PKG. 
The functions that can leverage PKG are the same ones 
that can leverage a combiner in MapReduce, i.e., associative 
functions and monoids. Examples of applications include naive 
Bayes, heavy hitters, and streaming parallel decision trees, as 
detailed in Section [Vi] On the contrary, other functions such 
as computing the median cannot be easily expressed via PKG. 


Example. Let us examine the streaming top-k word count 
example using PKG. In this case, each word is tracked by 
two counters on two different PEls. Each counter holds a 
partial count for the word, while the total count is the sum 
of the two partial counts. Therefore, the total memory usage 
is 2 x K, i.e., O(K). Compare this result to SG where 
the memory is 0(WK). Partial counts are sent downstream 
to an aggregator that computes the final result. For each 
word, the application sends two counters, and the aggregator 
performs a constant time aggregation. The total work for the 
aggregation is 0{I\). Conversely, with SG the total work is 
again 0(WK). Compared to KG, the implementation with 
PKG requires additional logic, some more memory and has 
some aggregation overhead. However, it also provides a much 
better load balance which maximizes the resource utilization 
of the cluster. The experiments in Section [V] prove that the 
benefits outweigh its cost. 


B. Local Load Estimation 

PoTC requires knowledge of the load of each worker to 
take its routing decision. A DSPE is a distributed system, and, 
in general, sources and workers are deployed on different 
machines. Therefore, the load of each worker is not readily 
available to each source. 

Interestingly, we prove that no communication between 
sources and workers is needed to effectively apply PoTC. 
We propose a local load estimation technique, whereby each 
source independently maintains a local load-estimate vector 
with one element per worker. The load estimates are updated 
by using only local information of the portion of stream sent 
by each source. We argue that in order to achieve global load 
balance it is sufficient that each source independently balances 
the load it generates across all workers. 





The correctness of local load estimation directly follows 
from our standard definition of load in Section QI] The load 
on a worker Li is simply the sum of the loads that each source 
j imposes on the given worker: Li(t) = L\(t). Each 

source j can keep an estimate of the load on each worker i 
based on the load it has generated If . As long as each source 
keeps its own portion of load balanced, then the overall load 
on the workers will also be balanced. Indeed, the maximum 
overall load is at most the sum of the maximum load that each 
source sees locally. It follows that the maximum imbalance is 
also at most the sum of the local imbalances. 

IV. Analysis 

We proceed to analyze the conditions under which PKG 
achieves good load balance. Recall from Section [n] that we 
have a set W of n workers at our disposal and receive 
a sequence of to messages k\,..., k m with values from a 
key universe 1C. Upon receiving the z-th message with value 
ki £ 1C, we need to decide its placement among the workers; 
decisions are irrevocable. We assume one message arrives per 
unit of time. Our goal is to minimize the eventual maximum 
load L(m), which is the same as minimizing the imbalance 
/(to). A simple placement scheme such as shuffle grouping 
provides an imbalance of at most one, but we would like to 
limit the number of workers processing each key to d £ N + . 

Chromatic balls and bins. We model our problem in the 
framework of balls and bins processes, where keys correspond 
to colors, messages to colored balls, and workers to bins. 
Choose d independent hash functions 'H \..... H,i: K. —> [n] 
uniformly at random. Define the Greedy-d scheme as follows: 
at time t, the f-th ball (whose color is /,; f ) is placed on the bin 
with minimum current load among Hi{kt), ■ ■. ,1-Ldikt), i.e., 
Ptikt) = argmin ie{Hi(fct)i ..., Wd(fet)} Recall that with 

key splitting there is no need to remember the choice for the 
next time a ball of the same color appears. 

Observe that when d = 1, each ball color is assigned to a 
unique bin so no choice has to be made; this models hash- 
based key grouping. At the other extreme, when d nlnn, 
all n bins are valid choices, and we obtain shuffle grouping. 

Key distribution. Finally, we assume the existence of an 
underlying discrete distribution V supported on 1C from which 
ball colors are drawn, i.e., ki,...,k m is a sequence of to 
independent samples from V. Without loss of generality, we 
identify the set 1C of keys with N + or, if 1C is finite of 
cardinality K = |/C|, with [K] = {1 We assume 

them ordered by decreasing probability: if pi is the probability 
of drawing key i from V, then p-\ > p-2 > p.i ... and 
= 1- We also identify the set W of bins with [n]. 

A. Imbalance with PARTIAL Key GROUPING 

Comparison with standard problems. As long as we keep 
getting balls of different colors, our process is identical to 
the standard Greedy-d process of Azar et al. E). This occurs 
with high probability provided that rn is small enough. But for 
sufficiently large in (e.g., when m > ^-), repeated keys will 


start to arrive. Recall that for any number of choices d > 2, the 
maximum imbalance after throwing m balls of different colors 
into n bins with the standard Greedy-d process is 1 ”^ ra + — + 
0(1). Unfortunately, such strong bounds (independent of m) 
cannot apply to our setting. To gain some intuition on what 
may go wrong, consider the following examples where d= 2. 

Note that for the maximum load not to be much larger than 
the average load, the number of bins used must not exceed 
0(l/pi), where p\ is the maximum key probability. Indeed, 
at any time we expect the two bins /z-! (1), (1) to contain 

together at least a pi fraction of all balls, just counting the 
occurrences of a single key. Hence the expected maximum 
load among the two grows at a rate of at least pi/2 per unit of 
time, while the overall average load increases by exactly - per 
unit of time. Thus, if pi > 2/n, the expected imbalance at time 
to will be lower bounded by — i)m, which grows linearly 
with to. This holds irrespective of the placement scheme used. 

However, requiring pi < 2/n is not enough to prevent im¬ 
balance f2 (to). Consider the uniform distribution over n keys. 
Let B = be the set of all bins that belong 

to one of the potential choices for some key. As is well-known, 
the expected size of B is n — n (l — ss n( 1— ^). So all 
n keys use only an (1 —~ 0.865 fraction of all bins, and 
roughly 0.135n bins will remain unused. In fact the imbalance 

after to balls will be at least iff/L -— « 0.156to. The 

problem is that most concrete instantiations of our two random 
hash functions cause the existence of an “overpopulated” set B 
of bins inside which the average bin load must grow faster than 
the average load across all bins. (In fact, this case subsumes 
our first example above, where B was {7fi(l),/^(l)}-) 

Finally, even in the absence of overpopulated bin subsets, 
some inherent imbalance is due to deviations between the 
empirical and true key distributions. For instance, suppose 
there are two keys 1,2 with equal probability \ and n = 4 
bins. With constant probability, key 1 is assigned to bins 1, 2 
and key 2 to bins 3,4. This situation looks perfect because the 
Greedy-2 choice will send each occurrence of key 1 to bins 
1, 2 alternately so the loads of bins 1, 2 will always equal up 
to ±1. However, the number of balls with key 1 seen is likely 
to deviate from to/ 2 by roughly Q(ffm), so either the top 
two or the bottom two bins will receive m/4 + CL(ffrn) balls, 
and the imbalance will be Cl(ffm) with constant probability. 

In the remainder of this section we carry out our analysis, 
which broadly construed asserts that the above are the only 
impediments to achieve good balance. 

Statement of results. We noted that once the number of bins 
exceeds 2/pi (where p\ is the maximum key frequency), the 
maximum load will be dominated by the loads of the bins to 
which the most frequent key is mapped. Hence the main case 
of interest is where p\ = O(-). 

We focus on the case where the number of balls is large 
compared to the number of bins. The following results show 
that partial key grouping can significantly reduce the maxi¬ 
mum load (and the imbalance), compared to key grouping. 




Theorem 4.1: Suppose we use n bins and let m > n 2 . As¬ 
sume a key distribution V with maximum probability pi < gP. 
Then the imbalance after m steps of the Greedy-d process 
satisfies, with probability at least 1 — A, 


J(m) 


0 (rn. hn.) if d = 1 

\n In In n / ’ 

o(f), if d> 2 


As the next result shows, the bounds above are best- 
possible^] 

Theorem 4.2: There is a distribution T> satisfying the hy¬ 


pothesis of Theorem 4.1 such that the imbalance after m steps 
of the Greedy-d process satisfies, with probability at least 

1-i, 

rj. 7 


f 1 1 


In r 


Iim) = r' J )n Inin"' 1 


if d = 1 
if d> 2 


We omit the proof of Theorem |4.2| (it follows by considering 
a uniform distribution over 5 n keys). The next section is 
devoted to the proof of the upper bound, Theorem |4.1| 


B. Proof 


Concentration inequalities. We recall the following results, 
which we need to prove our main theorem. 

Theorem 4.3 (Chernoff bounds): Suppose {Xi} is a finite 
sequence of independent random variables with X. t £ [0, M] 
and let Y = Ei X^, p, = E,; E[X, : ]. Then for all (3 > p, 

Pr [Y>/3\<C(p,f),M), 


where 


C(p, 13, M) = exp ( 


M 


Theorem 4.4 (McDiarmid’s inequality): Let X\,... ,X n 
be a vector of independent random variables and let / be a 
function satisfying |/(a) — /(a') | < 1 whenever the vectors 
a and a' differ in just one coordinate. Then 


Pr [f(X u ...,X n )> E[f(Xi,..., X n )\ + A] < exp(—2A 2 ). 


The p r measure of bin subsets. For every nonempty set of 
bins S C [ n\ and 1 < r < d, define 

Tr(S) = | {«!(*), • ■ • , Hr(t)} C B}. 


We will be interested in pi(B) (which measures the proba¬ 
bility that a random key from V will have its choice inside 
B ) and Pd{B) (which measures the probability that a random 
key from V will have all its choices inside B ). Note that 
Ti{B) = EjesM-u/}) and Pd{B) < /ii (B). 

Lemma 4.5: For every B C [n], E[/zi(.B)] = ^ and, if 


Pr 


Mi (B) > J^i(eA) 
n 



^How ever, the imbalance can be much smaller than the worst-case bounds from 
Theorem |4.11 if the probability of most keys is much smaller than p i, which is the 
case in many setups. 


Proof: The first claim follows from linearity of expecta¬ 
tion and the fac t tha t = 1- F° r the second, let \B\ = k. 


Using Theorem 4.3 Pr [fii(B) > ^(eA)] is at most 


C ( —, —eA,pi ) < exp (- eAlnA ) < exp(—fcAlnA) 


n n 


np 


since np\ <1. ■ 

Lemma 4.6: For every B C [n], E[/^(f?)] = and ^ 

provided that Pi < ^, 


Pr 


p d (B) > 

n 


< 



Proof: Again the fir st cl aim is easy. For the second, let 
Pr [pd{B) > |] is at most 


\B\ = k. Using Theorem 


4.3 





< exp ^ 


np i 

-“ ln (s) 


since npi < 1. ■ 

Corollary 4.7: Assume pi < d > 2. Then, with high 
probability. 


max 


Pd{B) 

\B\/n 


BC[n},\B\ < 



< 1. 


Proof: We use Lemma |4.5| and the union bound. The 
probability that the claim fails to hold is bounded by 


E 

|B|<n/5 


Pd(B) > 


< E 

k<n /5 

* E ( 

k<n /5 


5 k 


eny 

~k) 


ek 


5 k 



where we used (”) < (^) fc , valid for all k. ■ 

For a scheduling algorithm A and a set B C [n] of bins, 
write Lg(t) = ma Xj G B Lj(t) for the maximum load among 
the bins in B after t balls have been processed by A. 

Lemma 4.8: Suppose there is a set A C [n] of bins such 
that for all T C A, Pd(T) < ^ . Then A = Greedy-d satisfies 
L^(to) = O(^) + L^\ A (m) with high probability. 

Proof: We use a coupling argument. Consider the fol¬ 
lowing two independent processes V and Q: V proceeds as 
Greedy-d, while Q picks the bin for each ball independently 
at random from [n] and increases its load. Consider any time t 
at which the load vector is ut £ N™ and M t = is the 

set of bins with maximum load. After handling the i-th ball, 
let X t denote the event that V increases the maximum load 
in A because the new ball has all choices in M t (T A, and Y t 
denote the event that Q increases the maximum load in A. 
Finally, let Z t denote the event that V increases the maximum 
load in A because the new ball has some choice in M t H A 
and some choice in M t \ A, but the load of one of its choices 
in Mi n A is no larger. We identify these events with their 
indicator random variables. 


















Note that the maximum load in A at the end of Process V 
is L^(m) = X^efralC^* + z t), while at the end of Process Q 
is L®(m) = Etefm] Conditioned on any load vector w*, 
the probability of X t is 

\M t nA\ \M t \ 


Pr[X t | uj t ] = Hd(M t nA) < 


< 


= Pr [Y t | u t ], 


So Pr[X t | oj t \ < Pr[y t | w t ], which implies that for any 

b G N, Pr[E te[m ]*t < b \ > Pr E tG [ m ] Y t < &]■ But with 

high probability, the maximum load of Process Q is b = 
0(m/n), so ^2 t X t = 0(m/n) holds with at least the same 
probability. On the other hand, Y2t Zt < L'i^ n ]\ A (m) because 
each occurrence of Z t increases the maximum load on A , and 
once a time t is reached such that L^(i) > LiT^ A (m), event 
Z t must cease to happen. Therefore L^fjn) = XE[m] E + 
E te[rn] Z t_ < 0{m/n) + L f n L \ A (m), yielding the result. ■ 

Let 


Proof of Theorem 4.1 

A=fje [n] | Hi ({j}) > — 


3e 


Observe that every bin j ^ A has /Ji ({j}) < yf and this 
implies that, conditioned on any choice of hash functions, the 
maximum load of all bins outside A is at most Aim with high 
probabilityfl Therefore our task reduces to showing that the 
maximum load of the bins in A is O(-). 

Consider the sequence X \,..., Xk of random variables 
given by X, = Th{i), and let f(X ll X 2 ,. ■ -,X K ) = \A\ de¬ 
note the number of bins j with p\ ({j}) > yp. By Lemma 4.5 


E[|A|] = E[/] < Moreover, the function / satisfies 
the hypothesis of Theorem 4.4 We conclude that, with high 
probability, |A| < |. 

Now assume that the thesis of Corollary 4.7 holds, which 
happens except with probability o(l/n). T hen we have that 


for all B C A, Hd{B) < ^. Thus Lemma 4.8 applies to A. 


This means that after throwing m balls, the maximum load 
among the bins in A is O(-), as we wished to show. ■ 


V. Evaluation 


We assess the performance of our proposal by using both 
simulations and a real deployment. In so doing, we answer the 
following questions: 

Ql: What is the effect of key splitting on PoTC? 

Q2: How does local estimation compare to a global oracle? 
Q3: How robust is Partial Key Grouping? 

Q4: What is the overall effect of Partial Key Grouping 
on applications deployed on a real DSPE? 


A. Experimental Setup 

Datasets. Table Q] summarizes the datasets used. We use two 
main real datasets, one from Wikipedia and one from Twitter. 
These datasets were chosen for their large size, their differ¬ 
ent degree of skewness, and because they are representative 
of Web and online social network domains. The Wikipedia 


TABLE I: Summary of the datasets used in the experiments: 
number of messages, number of keys and percentage of 
messages having the most frequent key (pi). 


Dataset 

Symbol 

Messages 

Keys 

Pi (*0 

Wikipedia 

WP 

22M 

2.9M 

9.32 

Twitter 

TW 

1.2G 

31M 

2.67 

Cashtags 

CT 

690k 

2.9k 

3.29 

Synthetic 1 

LNi 

10M 

16k 

14.71 

Synthetic 2 

LN 2 

10M 

1.1k 

7.01 

Live Journal 

LJ 

69M 

4.9M 

0.29 

Slashdot0811 

SLi 

905k 

77k 

3.28 

Slashdot0902 

SL 2 

948k 

82k 

3.11 


dataset (WP^j is a log of the pages visited during a day in 
January 2008. Each visit is a message and the page’s URL 
represents its key. The Twitter dataset (TW) is a sample of 
tweets crawled during July 2012. Each tweet is parsed and 
split into its words, which are used as the key for the message. 

An additional Twitter dataset comprises a sample of tweets 
crawled in November 2013. The keys for the messages are 
the cashtags in these tweets. A cashtag is a ticker symbol 
used in the stock market to identify a publicly traded company 
preceded by the dollar sign (e.g., $AAPL for Apple). Popular 
cash tags change from week to week. This dataset allows to 
study the effect of shift of skew in the key distribution. 

We also generate two synthetic datasets (LNi, LN 2 ) with 
keys following a log-normal distribution, a commonly used 
heavy-tailed skewed distribution EH . The parameters of the 
distribution (pi=1.789, or=2.366; /U2=2.245, 02=L133) come 
from an analysis of Orkut, and try to emulate workloads from 
the online social network domain urn 

Finally, we experiment on three additional datasets com¬ 
prised of directed graph'll (LJ, SLi, SL 2 ). We use the edges 
in the graph as messages and the vertices as keys. These 
datasets are used to test the robustness of PKG to skew in 
partitioning the stream at the sources, as explained next. 
They also represent a different kind of application domain: 
streaming graph mining. 

Simulation. We process the datasets by simulating the DAG 
presented in Figure [T] The stream is composed of times- 
tamped keys that are read by multiple independent sources ( S ) 
via shuffle grouping, unless otherwise specified. The sources 
forward the received keys to the workers (W) downstream. 
In our simulations we assume that the sources perform data 
extraction and transformation, while the workers perform data 
aggregation, which is the most computationally expensive part 
of the DAG. Thus, the workers are the bottleneck in the DAG 
and the focus for the load balancing. 

B. Experimental Results 

Ql. We measure the imbalance in the simulations when using 
the following techniques: 


6 This is by majorization with the process that just throws every ball to the 
first choice; see, e.g, Azar et al. 0 


'http://www. wikibench.eu/?page_id=60 
* http://snap.stanford.edu/data 
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Fig. 2: Fraction of average imbalance with respect to total number of messages for each dataset, for different number of 
workers and number of sources. 


TABLE II: Average imbalance when varying the number of 
workers for the Wikipedia and Twitter datasets. 


Dataset 


WP 



TW 


W 

5 

10 

50 

100 

5 

10 

50 

100 

PKG 

0.8 

2.9 

5.9e5 

8.0e5 

0.4 

1.7 

2.74 

4.0e6 

Off-Greedy 

0.8 

0.9 

1.6e6 

1.8e6 

0.4 

0.7 

7.8e6 

2.0e7 

On-Greedy 

7.8 

1.4e5 

1.6e6 

1.8e6 

8.4 

92.7 

1.2e7 

2.0e7 

POTC 

15.8 

1.7e5 

1.6e6 

1.8e6 

2.2e4 

5.1e3 

1.4e7 

2.0e7 

Hashing 

1.4e6 

1.7e6 

2.0e6 

2.0e6 

4.1e7 

3.7e7 

2.4e7 

3.3e7 


H: Hashing, which represents standard key grouping (KG) 
and is our main baseline. We use a 64-bit Murmur hash 
function to minimize the probability of collision. 

PoTC: Power of two choices without using key splitting, i.e., 
traditional PoTC applied to key grouping. 

On-Greedy: Online greedy algorithm that picks the least 
loaded worker to handle a new key. 

Off-Greedy: Offline greedy sorts the keys by decreasing fre¬ 
quency and executes On-Greedy. 

PKG: PoTC with key splitting. 

Note that PKG is the only method that uses key splitting. 
Off-Greedy knows the whole distribution of keys so it repre¬ 
sents an unfair comparison for online algorithms. 

Table[II]shows the results of the comparison on the two main 
datasets WP and TW. Each value is the average imbalance 
measured throughout the simulation. As expected, hashing per¬ 
forms the worst, creating a large imbalance in all cases. While 
PoTC performs better than hashing in all the experiments, it is 
outclassed by On-Greedy on TW. On-Greedy performs very 
close to Off-Greedy, which is a good result considering that 
it is an online algorithm. Interestingly, PKG performs even 
better than Off-Greedy. Relaxing the constraint of KG allows 
to achieve a load balance comparable to offline algorithms. 

We conclude that PoTC alone is not enough to guarantee 
good load balance, and key splitting is fundamental not only 
to make the technique practical in a distributed system, but 
also to make it effective in a streaming setting. As expected, 
increasing the number of workers also increases the average 
imbalance. The behavior of the system is binary: either well 
balanced or largely imbalanced. The transition between the 


two states happens when the number of workers surpasses the 
limit 0(1/ p \) described in Section IV which happens around 
50 workers for WP and 100 for TW. 


Q2. Given the aforementioned results, we focus our attention 
on PKG henceforth. So far, it still uses global information about 
the load of the workers when deciding which choice to make. 
Next, we experiment with local estimation, i.e., each source 
performs its own estimation of the worker load, based on the 
sub-stream processed so far. 

We consider the following alternatives: 

G: PKG with global information of worker load. 

L: PKG with local estimation of worker load and different 
number of sources, e.g., L 5 denotes S = 5. 

LP: PKG with local estimation and periodic probing of worker 
load every T p minutes. For instance, L 5 P 1 denotes S = 5 
and T p = 1. When probing is executed, the local estimate 
vector is set to the actual load of the workers. 

Figure [2] shows the average imbalance (normalized to the 
size of the dataset) with different techniques, for different 
number of sources and workers, and for several datasets. The 
baseline (H) always imposes very high load imbalance on the 
workers. Conversely, PKG with local estimation (L) has always 
a lower imbalance. Furthermore, the difference from the global 
variant (G) is always less than one order of magnitude. Finally, 
this result is robust to changes in the number of sources. 

Figure [3] displays the imbalance of the system through time 
/(f) for TW, WP and CT, 5 sources, and for W = 10 and 
50. Results for W = 5 and W = 100 are omitted as they are 
similar to W = 10 and W = 50, respectively. PKG with global 
information (G) and its variant with local estimation (L 5 ) 
perform best. Interestingly, even though both G and L achieve 
very good load balance, their choices are quite different. In 
an experiment measuring the agreement on the destination of 
each message, G and L have only 47% Jaccard overlap. Hence, 
L reaches a local minimum which is very close in value to the 
one obtained by G, although different. Also in this case, good 
balance can only be achieved up to a number of workers that 
depends on the dataset. When that number is exceeded, the 
imbalance increases rapidly, as seen in the cases of WP and 
partially for CT for W = 50, where all techniques lead to the 
same high load imbalance. 
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Fig. 3: Fraction of imbalance through time for different 
datasets, techniques, and number of workers, with S = 5. 



workers 


Fig. 4: Fraction of average imbalance with uniform and 
skewed splitting of the input keys on the sources when using 
the LJ graph. 


To answer Q3, we additionally experiment with drift in 
the skew distribution by using the cashtag dataset (CT). The 
bottom row of Figure [3] demonstrates that all techniques 
achieve a low imbalance, even though the change of key 
popularity through time generates occasional spikes. 

In conclusion, PKG is robust to skew on the sources, and can 
therefore be chained to key grouping. It is also robust to the 
drift in key distribution common of many real-world streams. 


Finally, we compare our local estimation strategy with a 
variant that makes use of periodic probing of workers’ load 
every minute (L 5 P 1 ). Probing removes any inconsistency in the 
load estimates that the sources may have accumulated. How¬ 
ever, interestingly, this technique does not improve the load 
balance, as shown in Figure [3] Even increasing the frequency 
of probing does not reduce imbalance (results not shown in the 
figure for clarity). In conclusion, local information is sufficient 
to obtain good load balance, therefore it is not necessary to 
incur the overhead of probing. 

Q3. To operationalize this question, we use the directed graphs 
datasets. We use KG to distribute the messages to the sources 
to test the robustness of PKG to skew in the sources , i.e., 
when each source forwards an uneven part of the stream. We 
simulate a simple application that computes a function of the 
incoming edges of a vertex (e.g., in-degree, PageRank). The 
input keys for the source PE is the source vertex id, while the 
key sent to the worker PE is the destination vertex id, that is, 
the source PE inverts the edge. This schema projects the out- 
degree distribution of the graph on sources, and the in-degree 
distribution on workers, both of which are highly skewed. 

Figure [4] shows the average imbalance for the experiments 
with a skewed split of the keys to sources for the LJ social 
graph (results on SLi and SL 2 are similar to LJ and are 
omitted due to space constraint). Lor comparison, we include 
the results when the split is performed uniformly using shuffle 
grouping of keys on sources. On average, the imbalance 
generated by the skew on sources is similar to the one obtained 
with uniform splitting. As expected, the imbalance slightly 
increases as the number of sources and workers increase, but, 
in general, it remains at very low absolute values. 


Q4. We implement and test our technique on the streaming 
top-k word count example, and perform two experiments to 
compare PKG, KG, and SG on WP. We choose word count 
as it is one of the simplest possible examples, thus limiting 
the number of confounding factors. It is also representative 
of many data mining algorithms as the ones described in 
Section VI (e.g., counting frequent items or co-occurrences 
of feature-class pairs). Due to the requirement of real-world 
deployment on a DSPE, we ignore techniques that require 
coordination (i.e., PoTC and On-Greedy). We use a topology 
configuration of a single source along with 9 workers (coun¬ 
ters) running on a storm cluster of 10 virtual servers. We report 
overall throughput, end-to-end latency, and memory usage. 

In the first experiment, we emulate different levels of CPU 
consumption per key by adding a fixed delay to the processing. 
We prefer this solution over implementing a specific applica¬ 
tion in order to be able to control the load on the workers. 
We choose a range that is able to bring our configuration to 
a saturation point, although the raw numbers would vary for 
different setups. Even though real deployments rarely operate 
at saturation point, PKG allows better resource utilization, 
therefore supporting the same workload on a smaller number 
of machines. In this case, the minimum delay (0.1ms) cor¬ 
responds approximately to reading 400kB sequentially from 
memory, while the maximum delay ( 1 ms) to y^-th of a disk 
see 43 Nevertheless, even more expensive tasks exist: parsing 
a sentence with NLP tools can take up to 500ms f*°] 

The system does not perform aggregation in this setup, 
as we are only interested in the raw effect on the workers. 
Ligure |5ja) shows the throughput achieved when varying the 
CPU delay for the three partitioning strategies. Regardless of 
the delay, SG and PKG perform similarly, and their throughput 
is higher than KG. The throughput of KG is reduced by « 60% 


'http://brenocon.com/dean_perf.html 
1L http ://nlp. Stanford .edu/software/parser- faq. shtml#n 
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Fig. 5: (a) Throughput for PKG, SG and KG for different CPU 
delays, (b) Throughput for PKG and SG vs. average memory 
for different aggregation periods. 


when the CPU delay increases tenfold, while the impact on 
PKG and SG is smaller (« 37% decrease). We deduce that 
reducing the imbalance is critical for clusters operating close 
to their saturation point, and that PKG is able to handle 
bottlenecks similarly to SG and better than KG. In addition, 
the imbalance generated by KG translates into longer latencies 
for the application. When the workers are heavily loaded, the 
average latency with KG is up to 45% larger than with PKG. 
Finally, the benefits of PKG over SG regarding memory are 
substantial. Overall, PKG (3.6 M counters) requires about 30% 
more memory than KG (2.9 M counters), but about half the 
memory of SG (7.2 M counters). 

In the second experiment, we fix the CPU delay to 0.4ms 
per key, as it is the saturation point for KG in our setup. We 
activate the aggregation of counters at different time intervals 
T to emulate different application policies for when to receive 
up-to-date top-k word counts. In this case, PKG and SG need 
additional memory compared to KG to keep partial counters. 
Shorter aggregation periods reduce the memory requirements, 
as partial counters are flushed often, at the cost of a higher 
number of aggregation messages. Figure [5jb) shows the rela¬ 
tionship between throughput and memory overhead for PKG 
and SG. The throughput of KG is shown for comparison. For all 
values of aggregation period, PKG achieves higher throughput 
than SG, with lower memory overhead and similar average 
latency per message. When the aggregation period is above 
30s, the benefits of PKG compensate its extra overhead and its 
overall throughput is higher than when using KG. 


VI. Applications 

PKG is a novel programming primitive for stream parti¬ 
tioning and not every algorithm can be expressed with it. In 
general, all algorithms that use shuffle grouping can use PKG to 
reduce their memory footprint. In addition, many algorithms 
expressed via key grouping can be rewritten to use PKG in 
order to get better load balancing. In this section we provide 
a few such examples of common data mining algorithms, and 
show the advantages of PKG. Henceforth, we assume that each 
message contains a data point for the application, e.g., a feature 
vector in a high-dimensional space. 


A. Naive Bayes Classifier 

A naive Bayes classifier is a probabilistic model that as¬ 
sumes independence of features. It estimates the probability of 
a class C given a feature vector X by using Bayes’ theorem. 
In practice, the classifier works by counting the frequency of 
co-occurrence of each feature and class values. 

The simplest way to parallelize this algorithm is to spread 
the counters across several workers via vertical parallelism, 
i.e., each feature is tracked independently in parallel. Fol¬ 
lowing this design, the algorithm can be implemented by the 
same pattern used for the KG example in Section |I I-A Sparse 
datasets often have a skewed distribution of features, e.g., for 
text classification. Therefore, this implementation suffers from 
the same load imbalance, which PKG solves. 

Horizontal parallelism can also be used to parallelize the 
algorithm, i.e., by shuffling messages to separate workers. 
This implementation uses the same pattern as the DAG in the 
SG example in Section II-A The count for a single feature- 
class pair is distributed across several workers, and needs 
to be combined at prediction (query) time. This combination 
requires broadcasting the query to all the workers, as a feature 
can be tracked by any worker. This implementation, while 
balancing the work better than key grouping, requires an 
expensive query stage that may be affected by stragglers. 

PKG tracks each feature on two workers and avoids repli¬ 
cating counters on all workers. Furthermore, the two workers 
are deterministically assigned for each feature. Thus, at query 
time, the algorithm needs to probe only two workers for each 
feature, rather than having to broadcast it to all the workers. 
The resulting query phase is less expensive and less sensitive 
to stragglers than with shuffle grouping. 


B. Streaming Parallel Decision Tree 

A decision tree is a classification algorithm that uses a tree¬ 
like model where nodes are tests on features, branches are 
possible outcomes, and leafs are class assignments. 

Ben-Haim and Tom-Tov JT] propose an algorithm to build 
a streaming parallel decision tree that uses approximated 
histograms to find the test value for continuous features. Mes¬ 
sages are shuffled among W workers. Each worker generates 
histograms independently for its sub-stream, one histogram 
for each feature-class-leaf triplet. These histograms are then 
periodically sent to a single aggregator that merges them to 
get an approximated histogram for the whole stream. The 
aggregator uses this final histogram to grow the model by 
taking split decisions for the current leaves in the tree. Overall, 
the algorithm keeps W x D x C x L histograms, where D is 
the number of features, C is the number of classes, and L is 
the current number of leaves. 

The memory footprint of the algorithm depends on W, so 
it is impossible to fit larger models by increasing the paral¬ 
lelism. Moreover, the aggregator needs to merge W x D x C 
histograms each time a split decision is tried, and merging the 
histograms is one of the most expensive operations. 

Instead, PKG reduces both the space complexity and aggre¬ 
gation cost. If applied on the features of each message, a single 











feature is tracked by two workers, with an overall cost of only 
2 xDxCxL histograms. Furthermore, the aggregator needs to 
merge only two histograms per feature-class-leaf triplet. This 
scheme allows to alleviate memory pressure by adding more 
workers, as the space complexity does not depend on W. 

C. Heavy Hitters and Space Saving 

The heavy hitters problem consists in finding the top-k most 
frequent items occurring in a stream. The SPACESAVING ll23l 
algorithm solves this problem approximately in constant time 
and space. Recently, Berinde et al. m have shown that 
SPACESAVING is space-optimal, and how to extend its guaran¬ 
tees to merged summaries. This result allows for parallelized 
execution by merging partial summaries built independently 
on separate sub-streams. 

In this case, the error bound on the frequency of a single 
item depends on a term representing the error due to the 
merging, plus another term which is the sum of the errors 
of each individual summary for a given item i: 

w 

f, /, <A, 

i'=» 

where /, is the true frequency of item i and /, is the estimated 
one, each Aj is the error from summarizing each sub-stream, 
while Af is the error from summarizing the whole stream, 
i.e., from merging the summaries. 

Observe that the error bound depends on the parallelism 
level W. Conversely, by using KG, the error for an item 
depends only on a single summary, thus it is equivalent to 
the sequential case, at the expense of poor load balancing. 

Using PKG we achieve both benefits: the load is balanced 
among workers, and the error for each item depends on the 
sum of only two error terms, regardless of the parallelism level. 
However, the individual error bounds may depend on W. 

VII. Related Work 

Various works in the literature either extend the theoretical 
results from the power of two choices, or apply them to the 
design of large-scale systems for data processing. 

Theoretical results. Load balancing in a DSPE can be seen 
as a balls-and-bins problem, where m balls are to be placed 
in n bins. The power of two choices has been extensively 
researched from a theoretical point of view for balancing the 
load among machines |4j |20) . Previous results consider each 
ball equivalent. For a DSPE, this assumption holds if we map 
balls to messages and bins to servers. However, if we map 
balls to keys, more popular keys should be consider to be 
heavier. ED tackle the case where each ball has a weight 
drawn independently from a fixed weight distribution X. They 
prove that, as long as X is “smooth”, the expected imbalance 
is independent of the number of balls. However, the solution 
assumes that X is known beforehand, which is not the case in 
a streaming setting. Thus, in our work we take the standard 
approach of mapping balls to messages. 


Another assumption common in previous works is that there 
is a single source of balls. Existing algorithms that extend 
PoTC to multiple sources execute several rounds of intra¬ 
source coordination before taking a decision ns he E3i. 
Overall, these techniques incur a significant coordination 
overhead, which becomes prohibitive in a DSPE that handles 
thousands of messages per second. 

Stream processing systems. Existing load balancing tech¬ 
niques for DSPEs are analogous to key grouping with rebalanc¬ 
ing 0 [8] 0 [Jol ClD El. In our work, we consider operators 
that allow replication and aggregation, similar to a standard 
combiner in map-reduce, and show that it is sufficient to 
balance load among two replicas based local load estimation. 
We refer to Section IH-AI for a more extensive discussion of 
key grouping with rebalancing. Flux monitors the load of 
each operator, ranks servers by load, and migrates operators 
from the most loaded to the least loaded server, from the 
second most loaded to the second least loaded, and so on 0. 
Aurora* and Medusa propose policies to migrating operators 
in DSPEs and federated DSPEs (8). Borealis uses a similar 
approach but it also aims at reducing the correlation of load 
spikes among operators placed on the same server 0- This 
correlation is estimated by using a finite set of load samples 
taken in the recent past. Gedik ma developed a partitioning 
function (a hybrid between explicit mapping and consistent 
hashing of items to servers) for stateful data parallelism in 
DSPEs that leverages item frequencies to control migration 
cost and imbalance in the system. Similarly, Balkesen et al. 
HD proposed frequency-aware hash-based partitioning to 
achieve load balance. Castro Fernandez et al. 02 propose 
integrating common operator state management techniques for 
both checkpointing and migration. 

Other distributed systems. Several storage systems use con¬ 
sistent hashing to allocate data items to servers (25). Consistent 
hashing substantially produces a random allocation and is de¬ 
signed to deal with systems where the set of servers available 
varies over time. In this paper, we propose replicating DSPE 
operators on two servers selected at random. One could use 
consistent hashing also to select these two replicas, using the 
replication technique used by Chord ll26l and other systems. 

Sparrow m is a stateless distributed job scheduler that 
exploits a variant of the power of two choices l24l . It employs 
batch probing, along with late binding, to assign m tasks of 
a job to the least loaded of d x to randomly selected workers 
(d > 1). Sparrow considers only independent tasks that can be 
executed by any worker. In DSPEs, a message can only be sent 
to the workers that are accumulating the state corresponding 
to the key of that message. Furthermore, DSPEs deal with 
messages that arrive at a much higher rate than Sparrow’s 
fine-grained tasks, so we prefer to use local load estimation. 

In the domain of graph processing, several systems have 
been proposed to solve the load balancing problem, e.g., 
Mizan (28) , GPS (29) , and xDGP 01)1 . Most of these systems 
perform dynamic load rebalancing at runtime via vertex migra¬ 
tion. We have already discussed why rebalancing is impractical 
in our context in Section mi 




Finally, SkewTune 071 solves the problem of load balanc¬ 
ing in MapReduce-like systems by identifying and redistribut¬ 
ing the unprocessed data from the stragglers to other workers. 
Techniques such as SkewTune are a good choice for batch 
processing systems, but cannot be directly applied to DSPEs. 

VIII. Conclusion 

Despite being a well-known problem in the literature, load 
balancing has not been exhaustively studied in the context of 
distributed stream processing engines. Current solutions fail 
to provide satisfactory load balance when faced with skewed 
datasets. To solve this issue, we introduced PARTIAL Key 
Grouping, a new stream partitioning strategy that allows 
better load balance than key grouping while incurring less 
memory overhead than shuffle grouping. Compared to key 
grouping, PKG is able to reduce the imbalance by up to several 
orders of magnitude, thus improving throughput and latency 
of an example application by up to 45%. 

This work gives rise to further interesting research ques¬ 
tions. Is it possible to achieve good load balance without 
foregoing atomicity of processing of keys? What are the 
necessary conditions, and how can it be achieved? In partic¬ 
ular, can a solution based on rebalancing be practical? And 
in a larger perspective, which other primitives can a DSPE 
offer to express algorithms effectively while making them run 
efficiently? While most DSPEs have settled on just a small set, 
the design space still remains largely unexplored. 
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